168-31: Getting Your Random Sample in PROC SQL
نویسنده
چکیده
Proc SQL can be used to get a random sample from a large dataset with relative ease. A more common method of getting a random sample from a large dataset requires using the data step along with some programming or using the SURVEYSELECT procedure which became available in SAS/STAT beginning with SAS Version 8 ®. It is relatively easy to get a simple random sample using only the SQL procedure. INTRODUCTION There are times when a user may desire or need to work with a subset of a very large dataset and therefore may need to obtain a random sample of the larger dataset. The SURVEYSELECT procedure, introduced in SAS/STAT with SAS Version 8, is a procedure which provides a variety of methods for selecting simple random samples or random samples based on more complex multistage sample designs. Prior to the introduction of Proc SURVEYSELECT, selecting a random sample from a large dataset required programming using the DATA step. Another way to obtain a data subset which is a random sample of a large dataset is to use Proc SQL. While this method requires some basic knowledge of Proc SQL, it does not require any data step programming. The purpose of this poster/paper is to demonstrate the relative ease with which the user can obtain a simple random sample from a large dataset using only Proc SQL. RANDOM SAMPLES Ideally, random samples should be representative of the “population” from which they are randomly selected or drawn. For our purposes, the large dataset that we wish to sample from is the “population” and the dataset records are the observations. To obtain a simple random sample, we must select records from the population in a random manor such that each record has an equal chance of being selected, i.e. the last record should have the same chance of being selected as the first record or any of the records in between. For example: if we select a random sample from a dataset consisting of a company’s employee data in which 70% of the employees are female, then our random sample from this employee file should have approximately 70% females in it and every employee in the complete dataset should have had an equal chance of being selected. SAMPLE SIZE The size of a random sample can be a predetermined fixed number or it can be a percentage of the total number of observations or records in the large dataset. So, we can pull a random sample of 100,000 observations from the population dataset, assuming there are more than 100,000 observations to begin with, or we can pull a random sample consisting of 10% of all the observations in the large dataset. Methods of using Proc SQL to draw random samples for both scenarios will be presented. THE “POPULATION” DATA To demonstrate how to use Proc SQL to obtain a random sample from a large dataset, we will use a dataset that has one million records. The data in this ‘population’ dataset will have specific and known distributions so that we can show that the random samples we select are in fact representative of the population dataset. The data we will use for our large “population” dataset is a fictitious dataset created for illustrative purposes only. The dataset name is POP_DATASET and the data consists of the age and gender of one million people as well as two variables representing some characteristic or property of each record/person. The dataset has 5 variables: Record_no: this is a unique record/person identifier Norm01_rv: A random variable with a Standard Normal distribution (μ = 0, σ = 1) Uni01_rv: A random variable Uniformly distributed on the range (0,1) Sex: the gender of the person Age: the age in years of the person The distribution of the variable sex is 30 percent ‘Male’ and 70% ‘Female’. Posters SUGI 31
منابع مشابه
114-31: A Hands-On Tour Inside the World of PROC SQL
Structured Query Language (PROC SQL) is a database language found in the base-SAS software. It enables access to data stored in SAS data sets or tables using a powerful assortment of statements, clauses, options, functions, and other language features. This hands-on workshop presents core concepts of this powerful language as well as its many applications, and is intended for SAS users who desi...
متن کامل242-31: SQL Set Operators: So Handy Venn You Need Them
When it comes to combining data from multiple tables in the SQL Procedure, joins get most of the attention and subqueries are probably second. Often overlooked are the set operators (OUTER UNION, UNION, INTERSECT, and EXCEPT). This tutorial begins by relating OUTER UNION to similar functionality provided by the DATA step's SET statement, then explains and demonstrates the full repertoire of SET...
متن کاملRandom Effects Simulation for Sample Size Calculations Using SAS
Sample size calculations are a critical step in the planning of any experiment. In all but the simplest of experimental designs, closed-form equations are not readily available, and statisticians are required to use simulations to estimate an appropriate sample size for the experiment. Specifically, when multiple explanatory variables are thought to be predictive of the response or when missing...
متن کاملOld versus New: A Comparison of PROC LOGISTIC and PROC GLIMMIX
In the past, the SAS programming tools available for logistic regression problems have been trapped in a “fixed effects” modeling world. PROC LOGISTIC gives very few options when dealing with random effects, which has made the modeling of binary data from any kind of experimental design challenging at best. Such design elements as blocking or repeated measures are not readily analyzed using PRO...
متن کاملPaper Template
PROC SQL is SAS’ implementation of the ANSI Standard Query Language (SQL), an ANSI standard database programming/query language. Syntax is simple and its utility is broad in scope. As another capability, we find that PROC SQL is a very useful tool to perform exact test for monotonic trend in mean responses of discrete random variables. In this paper, we show utilization of PROC SQL to construct...
متن کامل